多次人类解析的现有方法通常采用两阶段的策略(通常是自下而上和自下而上),这遭受了对先前检测的强烈依赖,或者在集体后过程中高度计算的冗余。在这项工作中,我们使用代表性零件(称为Repparser)提出了一个端到端的多个人类解析框架。与主流方法不同,repparser以新的单阶段的方式解决了多个人的解析,而无需诉诸于人的检测或组后。为此,repparser将解析管道解析为实例感知的内核产生和部分意识到的人类解析,并将其分解为部分。分别负责实例分离和特定于实例的部分分割。特别是,我们通过代表性部分赋予解析管道的能力,因为它们的特征是通过实例感知关键点,并且可以用来动态解析每个人的实例。具体而言,代表性零件是通过共同定位实例中心和估计身体部位区域的关键来获得的。之后,我们通过代表性部分动态预测实例感知的卷积内核,从而将人零件的上下文编码为每个负责将图像特征施放为实例特定表示的内核。furthermore。采用多支出结构来分割每个实例 - 特定的表示单独的部分分割的几个部分感知表示。这样,以代表性零件的指导,重新集中在人实例上,并直接为每个人实例输出解析结果,从而消除了先前检测或发布的要求-grouping。在两个具有挑战性的基准上进行的扩展实验表明,我们提出的repparser是一个简单而有效的框架,并取得了竞争性的表现。
translated by 谷歌翻译
基于骨架的动作识别旨在将骨骼序列投影到动作类别,其中骨骼序列源自多种形式的预测点。与较早的方法相比,该方法专注于通过图形卷积网络(GCN)探索单一形式的骨架,现有方法倾向于通过互补提示来利用多色骨架来改善GCN。但是,这些方法(GCNS的适应结构或模型集合的适应结构)都需要在训练和推理阶段共存所有形式的骨骼,而现实生活中的典型情况是仅存在推论的部分形式。为了解决这个问题,我们提出了自适应的交叉形式学习(ACFL),该学习促进了精心设计的GCN,以在不改变模型容量的情况下从单色骨架中生成互补的表示。具体而言,ACFL中的每个GCN模型不仅从单一形式的骨架中学习动作表示,而且还可以自适应地模拟从其他形式的骨骼中得出的有用表示。这样,每个GCN都可以学习如何增强所学的知识,从而利用模型潜力并促进行动识别。在三个具有挑战性的基准上进行的广泛实验,即NTU-RGB+D 120,NTU-RGB+D 60和UAV-Human,证明了该方法的有效性和普遍性。具体而言,ACFL显着改善了各种GCN模型(即CTR-GCN,MS-G3D和Shift-GCN),从而获得了基于骨架的动作识别的新记录。
translated by 谷歌翻译
零件级别的属性解析是一项基本但具有挑战性的任务,它需要区域级的视觉理解以提供可解释的身体部位细节。大多数现有方法通过添加具有属性预测头到两阶段检测器的区域卷积神经网络(RCNN)来解决此问题,其中从本地零件框中确定了身体部位的属性。但是,具有极限视觉线索的本地零件框(即仅零件外观)会导致不满意的解析结果,因为身体部位的属性高度依赖于它们之间的全面关系。在本文中,我们建议通过利用丰富的知识来识别嵌入式RCNN(KE-RCNN)来识别属性-hip)和显式知识(例如,``短裤''的一部分不能具有``连帽衫''或``衬里''的属性)。具体而言,KE-RCNN由两个新型组件,即基于隐式知识的编码器(IK-en)和基于知识的显式解码器(EK-DE)组成。前者旨在通过将部分的关系上下文编码到部分框中来增强零件级的表示,而后者则建议通过有关\ textit {part-attribute}关系的先验知识的指导来解码属性。这样,KE-RCNN就是插件播放,可以集成到任何两阶段检测器中,例如attribute-rcnn,cascade-rcnn,基于HRNET的RCNN和基于Swintransformer的RCNN。在两个具有挑战性的基准上进行的广泛实验,例如Fashionpedia和Kinetics-TPS,证明了KE-RCNN的有效性和概括性。特别是,它比所有现有方法都取得了更高的改进,在时尚Pedia上达到了3%的AP,而动力学TPS的ACC约为4%。
translated by 谷歌翻译
人类茂密的估计旨在建立人体2D像素与3D人体模板之间的密集对应关系,是使机器能够了解图像中人员的关键技术。由于实际场景是复杂的,只有部分注释可用,导致无能为力或错误的估计,它仍然构成了几个挑战。在这项工作中,我们提出了一个新颖的框架,以检测图像中多人的密集。我们指的是知识转移网络(KTN)的建议方法解决了两个主要问题:1)如何完善图像表示以减轻不完整的估计,以及2)如何减少由低质量培训标签引起的错误估计(即。 ,有限的注释和班级不平衡标签)。与现有的作品直接传播区域的锥体特征以进行致密估计,KTN使用金字塔表示的改进,同时它可以维持特征分辨率并抑制背景像素,并且这种策略导致准确性大幅提高。此外,KTN通过外部知识增强了基于3D的身体解析的能力,在该知识中,它通过结构性的身体知识图,从足够的注释作为基于3D的身体解析器进行训练。通过这种方式,它大大减少了由低质量注释引起的不利影响。 KTN的有效性通过其优越的性能优于致密coco数据集的最先进方法。关于代表性任务(例如,人体分割,人体部分分割和关键点检测)和两个流行的致密估计管道(即RCNN和全面卷积框架)的广泛消融研究和实验结果,进一步表明了提议方法的概括性。
translated by 谷歌翻译
部分一级的行动解析针对部分状态解析为影片提升动作识别。尽管在视频分类研究领域戏剧性的进展,面临的社会的一个严重问题是,人类活动的详细了解被忽略。我们的动机是,解析人的行动需要建立模型,专注于特定的问题。我们提出了一个简单而有效的方法,迎刃而解命名解析动作(DAP)。具体来说,我们划分部分一级行动解析为三个阶段:1)人的检测,当一个人检测器采用检测从影片的所有人员以及进行实例级动作识别; 2)部分解析,其中解析部分模型提出了识别来自检测到的人物图像人类份;和3)动作解析,其中,多模态动作解析网络用于分析动作类别调节对从先前阶段获得的所有检测结果。随着应用这三大车型,我们DAP的方法记录$ 0.605 $得分的全球平均在2021动力学-TPS挑战。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Knowledge graphs (KG) have served as the key component of various natural language processing applications. Commonsense knowledge graphs (CKG) are a special type of KG, where entities and relations are composed of free-form text. However, previous works in KG completion and CKG completion suffer from long-tail relations and newly-added relations which do not have many know triples for training. In light of this, few-shot KG completion (FKGC), which requires the strengths of graph representation learning and few-shot learning, has been proposed to challenge the problem of limited annotated data. In this paper, we comprehensively survey previous attempts on such tasks in the form of a series of methods and applications. Specifically, we first introduce FKGC challenges, commonly used KGs, and CKGs. Then we systematically categorize and summarize existing works in terms of the type of KGs and the methods. Finally, we present applications of FKGC models on prediction tasks in different areas and share our thoughts on future research directions of FKGC.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译